Red Wine Exploration by Piranut Lapprathana

Univariate Plots

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Most wines score around 5.6 in terms of quality. The median density is 0.997 g/cm^3. 75% of wine has pH less than 3.4 (most wines are between 3-4 on the pH scale)

Exploratory, quick histogram plots

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Distribution of wine quality

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

Distribution of residual.sugar

## Warning: Removed 11 rows containing non-finite values (stat_bin).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

Distribution of citric acid

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000
## 
##    0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09  0.1 0.11 0.12 0.13 0.14 
##  132   33   50   30   29   20   24   22   33   30   35   15   27   18   21 
## 0.15 0.16 0.17 0.18 0.19  0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 
##   19    9   16   22   21   25   33   27   25   51   27   38   20   19   21 
##  0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39  0.4 0.41 0.42 0.43 0.44 
##   30   30   32   25   24   13   20   19   14   28   29   16   29   15   23 
## 0.45 0.46 0.47 0.48 0.49  0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 
##   22   19   18   23   68   20   13   17   14   13   12    8    9    9    8 
##  0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69  0.7 0.71 0.72 0.73 0.74 
##    9    2    1   10    9    7   14    2   11    4    2    1    1    3    4 
## 0.75 0.76 0.78 0.79    1 
##    1    3    1    1    1

Distribution of volatile acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

No wine has volatile acidity value of 0, min. is 0.12, max. is 1.58.

## 
##  0.12  0.16  0.18  0.19   0.2  0.21  0.22  0.23  0.24  0.25  0.26  0.27 
##     3     2    10     2     3     6     6     5    13     7    16    14 
##  0.28  0.29 0.295   0.3 0.305  0.31 0.315  0.32  0.33  0.34  0.35  0.36 
##    23    16     1    16     2    30     2    23    20    30    22    38 
## 0.365  0.37  0.38  0.39 0.395   0.4  0.41 0.415  0.42  0.43  0.44  0.45 
##     2    24    35    35     2    37    33     3    31    43    23    22 
##  0.46  0.47 0.475  0.48  0.49   0.5  0.51  0.52  0.53  0.54 0.545  0.55 
##    31    21     2    24    35    46    24    33    29    31     5    20 
##  0.56 0.565  0.57 0.575  0.58 0.585  0.59 0.595   0.6 0.605  0.61 0.615 
##    34     1    28     3    38     3    39     1    47     3    27     6 
##  0.62 0.625  0.63 0.635  0.64 0.645  0.65 0.655  0.66 0.665  0.67 0.675 
##    24     3    29     9    27    12    16     7    26     3    23     3 
##  0.68 0.685  0.69 0.695   0.7 0.705  0.71 0.715  0.72 0.725  0.73 0.735 
##    12    11    23     7    10     6     3    12     5     9     6     8 
##  0.74 0.745  0.75 0.755  0.76 0.765  0.77 0.775  0.78 0.785  0.79 0.795 
##    11     5     6     3     5     5     6     4    10     8     2     2 
##   0.8 0.805  0.81 0.815  0.82 0.825  0.83 0.835  0.84 0.845  0.85 0.855 
##     3     1     2     3     5     1     4     4     8     1     2     3 
##  0.86 0.865  0.87 0.875  0.88 0.885  0.89 0.895   0.9  0.91 0.915  0.92 
##     2     1     4     2     5     5     1     1     3     3     4     1 
## 0.935  0.95 0.955  0.96 0.965 0.975  0.98     1 1.005  1.01  1.02 1.025 
##     2     1     1     3     3     1     3     3     1     1     4     1 
## 1.035  1.04  1.07  1.09 1.115  1.13  1.18 1.185  1.24  1.33  1.58 
##     1     3     1     1     1     1     1     1     1     2     1

Density of wine

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

Chlorides

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Categorizing wine into different ratings (bad, average, good) based on quality. ‘bad’ (0-4), ‘average’ (5-6), ‘good’ (7-10)

##     bad average    good 
##      63    1319     217

Combining all acids (fixed, volatile(acetic acid), citric acid)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.270   7.827   8.720   9.118  10.070  17.050

Univariate Analysis

Structure of dataset:

There are 1599 wines in the dataset with 12 features(fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality).

quality 0(worst)- 10(best)

Most wines score around 5.6 in terms of quality (values ranged from 3-8). The median density is 0.997 g/cm^3. 75% of wines have pH less than 3.4 (most wines are between 3-4 on the pH scale).

Main feature(s) of interest:

Volatile.acidity and residual sugar are interesting features to explore. I want to find out how these two features correlate with the quality of wine. Other features like chlorides and density may help explain the quality of wine.

Other features that may support the investigation into the feature(s) of interest:

chlorides, density, pH and alcohol

New variables created:

I created a ‘rating’ variable which classifies wines into ‘bad’,‘average’ and ‘good’. I also created a variable called ‘total.acidity’ to hold the sum of all acids (fixed, volatile, and citric).

Operations on the data to tidy, adjust, or change the form of the data:

log-transformed the residual sugar and volatile acidity.

Bivariate Plots

Correlation between alcohol and density

## 
##  Pearson's product-moment correlation
## 
## data:  rw$alcohol and rw$density
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5322547 -0.4583061
## sample estimates:
##        cor 
## -0.4961798

Quality and volatile.acidity

## rw$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4400  0.6475  0.8450  0.8845  1.0100  1.5800 
## -------------------------------------------------------- 
## rw$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.230   0.530   0.670   0.694   0.870   1.130 
## -------------------------------------------------------- 
## rw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.180   0.460   0.580   0.577   0.670   1.330 
## -------------------------------------------------------- 
## rw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.3800  0.4900  0.4975  0.6000  1.0400 
## -------------------------------------------------------- 
## rw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4039  0.4850  0.9150 
## -------------------------------------------------------- 
## rw$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2600  0.3350  0.3700  0.4233  0.4725  0.8500
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Higher quality wines have lower volatile acidity (high levels of volatile acidity can lead to an unpleasant, vinegar taste)

Quality and alcohol

## rw$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.580  11.000 
## -------------------------------------------------------- 
## rw$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## rw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## rw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## rw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## rw$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Higher quality wines have a higher percentage of alcohol, quality 8 has the highest median alcohol content (12.15)

Generating box plots for each variable against quality

Calculating correlations against quality for each variable

##        fixed.acidity     volatile.acidity          citric.acid 
##           0.12405165          -0.39055778           0.22637251 
##        total.acidity log10.residual.sugar      log10.chlordies 
##           0.10375373           0.02353331          -0.17613996 
##  free.sulfur.dioxide total.sulfur.dioxide              density 
##          -0.05065606          -0.18510029          -0.17491923 
##                   pH      log10.sulphates              alcohol 
##          -0.05773139           0.30864193           0.47616632

Correlation

## 
##  Pearson's product-moment correlation
## 
## data:  rw$alcohol and rw$density
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5322547 -0.4583061
## sample estimates:
##        cor 
## -0.4961798
## 
##  Pearson's product-moment correlation
## 
## data:  rw$residual.sugar and rw$density
## t = 15.189, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3116908 0.3973835
## sample estimates:
##       cor 
## 0.3552834
## 
##  Pearson's product-moment correlation
## 
## data:  rw$citric.acid and rw$pH
## t = -25.767, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5756337 -0.5063336
## sample estimates:
##        cor 
## -0.5419041
## 
##  Pearson's product-moment correlation
## 
## data:  rw$fixed.acidity and rw$pH
## t = -37.366, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7082857 -0.6559174
## sample estimates:
##        cor 
## -0.6829782

Correlations between variables faceted by rating

## 
##  Pearson's product-moment correlation
## 
## data:  rw$fixed.acidity and rw$citric.acid
## t = 36.234, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6438839 0.6977493
## sample estimates:
##       cor 
## 0.6717034

## 
##  Pearson's product-moment correlation
## 
## data:  rw$volatile.acidity and rw$citric.acid
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5856550 -0.5174902
## sample estimates:
##        cor 
## -0.5524957

## 
##  Pearson's product-moment correlation
## 
## data:  log10(rw$total.acidity) and rw$pH
## t = -39.663, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7283140 -0.6788653
## sample estimates:
##        cor 
## -0.7044435

Strong negative correlation between total acidity and pH as expected (higher acidity means lower pH value)

Bivariate Analysis

Relationships observed between feature(s) of interest and other features in the dataset:

Quality of wine correlates strongly with volatile acidity and alcohol. As wine quality increases, volatile acidity decreases(high of levels of volatile acidity can lead to an unpleasant, vinegar taste). As wine quality increases, the percentage of alcohol increases (quality 8 has the highest median alcohol content). Highest quality wines have the lowest median density and lowest quality wines have the highest median density.

Most wines have residual sugar between 1 and 5 g/dm3. But there are some outliers for quality 6 and 5.

Interesting relationships between the other features:

The following variables have relatively higher correlation with quality: - volatile.acidity - log10.sulphates - alcohol - citric acid

There is a moderate negative correlation between alcohol and density. A weak positive correlation (0.355) between sugar and density. There is a moderate negative correlation (-5.42) between citric acid and pH level as expected since higher level of acidity gives a lower value on the pH scale. A strong negative correlation (-0.683) between total acidity and pH as expected since a higher acidity gives a lower pH value.

Strongest relationships:

Quality correlates strongly with alcohol and volatile acidity. Strong negative relationship between pH and total acidity. Citric acid and fixed acidity. Citric acid and volatile acidity.

Multivariate Plots

Examining 4 variables with the highest correlation. alcohol, volatile acidity, pH and sulphates

Relationship between citric acid and volatile acidity by quality

## 
##  Pearson's product-moment correlation
## 
## data:  rw$free.sulfur.dioxide and rw$total.sulfur.dioxide
## t = 35.84, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6395786 0.6939740
## sample estimates:
##       cor 
## 0.6676665

Strong positive correlation between free sulfur dioxide and total sulfur dioxide (0.67)

Multivariate Analysis

Relationships observed:

4 main features (citric acid, volatile acidity, pH and alcohol) which correlate strongly with quality are examined. I faceted the plots by rating to separate the scatterplots into three categories: bad, average, and good. Good wines tend to have higher citric acid and lower volatile acid. Bad quality wines have lower levels of citric acid compared to average and good quality wines. This shows that quality is determined by the type of acid present.


Final Plots and Summary

Plot One

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

This shows the normal distribution of volatile acidity present in wines. Most wines have volatile acid between 0.39 and 0.64 g/dm^3.

Plot Two

These boxplots show how different variables affect the quality of wine. Good quality wines have lower volatile acidity, pH and higher alcohol and citric acid. The outliers in each plot demonstrates that quality is affected by various factors.

Plot Three

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"              "rating"               "total.acidity"

High quality wines have higher levels of citric acid and lower levels of volatile acidity than low quality wines.